1 Morning

1.1 FASTQ format and FastQC/MultiQC

NOTE: the example data here is an uncompressed fastq file. In general, you should ALWAYS store fastq files as gzipped compressed (.fastq.gz). We use the compressed one here to show the contrast in file size after zipping it, and to let us easily use simple tools like grep and head.

1.1.1 Initial setup.

  1. Let’s make a new folder to do today’s exercises in
mkdir ~/ngs
cd ~/ngs
  1. Create a soft link to the test fastq file we’ll be using.
ln -s /workshops/ric_workshop/common/3_NGS/test1.fastq

1.1.2 Checking FASTQ files

  1. Run a head on the file to see a FASTQ file
head test1.fastq
@HWI-D00758:53:C7MV4ANXX:7:1101:1203:2160 1:N:0:CAGGAC
CACTCTACCGTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGAAAAAAAAACTTTT
+
</B//F<F/<<BBFFBBFFB//<FFFF//F<7/BBBBFFF<7<B//<7/F///</<B///BB<FF/BB<BBF/BFB<7B/7B##################
@HWI-D00758:53:C7MV4ANXX:7:1101:1337:2249 1:N:0:CAGGAC
GACTATGATGCCCTAGATGTTGCCAACAAGATTGGGATCATCTAAACTGAGTCCAGATGGCTAATTCTAAATATATACTTTTTTCACCATAAAAAAAAAA
+
BBBBBFF<FFFFFFFF<FFFFFFF/<FFFF/FFFFFBFF/F<FFFFFFFFFFFFFF/FFFBFBBFF/FFFFFFFFFFFFFFFFFFF//<FFFFFFFFFFF
@HWI-D00758:53:C7MV4ANXX:7:1101:1698:2127 1:N:0:CAGGAC
TGGGCTACATTTTCTTATAAAAGAACATTACTATACCCTTTATGAAACTAAAGGACTAAGGAGGATTTAGTAGTAAATTAAGAATAGAGAGCTTAATTGA
  1. Check the number of lines in the file. With a FASTQ file the number of lines is 4 times the number of sequences
wc -l test1.fastq
100000 test1.fastq
  1. Check the size of the FASTQ file (uncompressed)
ls -lhL test1.fastq
-rw-rw-r--. 1 gchlip2 rrc_shared 6.3M May 20 09:58 test1.fastq
  1. Compress the file and check the compressed size
gzip -c test1.fastq > test1.fastq.gz
ls -lh test1.fastq.gz
-rw-rw-r-- 1 gchlip2 gchlip2 1.2M Jun 17 12:13 test1.fastq.gz

1.1.3 Running FastQC

  1. Copy your SLURM template as a new script in the local directory
cp ~/slurm_template.sh run_fastqc.sh
  1. Edit the new script in nano:
nano run_fastqc.sh

To look like the following. You can also copy from /workshops/ric_workshop/common/3_NGS/run_fastqc.sh

#!/bin/sh
#SBATCH --partition=batch
#SBATCH --account=ric_workshop
#SBATCH -o %x.o%J

cd $SLURM_SUBMIT_DIR

module load FastQC

fastqc test1.fastq.gz

Remember: Ctrl+O to save, Ctrl+X to exit from nano.

  1. Submit the script to the queue
sbatch run_fastqc.sh
  1. Once the job is down, download the report (test1_fastqc.html) and view in a browser, i.e. double click the file.

1.1.4 Running MultiQC

For this exercise, we will run FastQC on a set of FASTQ files and then use MultiQC to summarize the FastQC reports.

  1. Create a softlink for our example FASTQ files.
ln -s /workshops/ric_workshop/common/3_NGS/fastq_files
  1. Copy your SLURM template as a new script in the local directory
cp ~/slurm_template.sh run_multiqc.sh
  1. Edit (nano run_multiqc.sh) the SLURM script to look like the following. You can also copy from /workshops/ric_workshop/common/3_NGS/run_multiqc.sh
#!/bin/sh
#SBATCH --partition=batch
#SBATCH --account=ric_workshop
#SBATCH -o %x.o%J

cd $SLURM_SUBMIT_DIR

module load FastQC

mkdir fastqc_out

fastqc -o fastqc_out fastq_files/*.fastq.gz

module load MultiQC

multiqc fastqc_out

Remember: Ctrl+O to save, Ctrl+X to exit from nano.

  1. Submit the script to the queue
sbatch run_multiqc.sh
  1. Once the job is down, download the multiqc_report.html

  2. Open the multiqc_report.html in your browser (Double click on the file)

1.2 Trimming

1.2.1 Look at adapter contamination

  • This data set is a snippet of a 3’ RNA-seq sample. This means we’ll see read-through into polyA tails, then into adapter sequence.
    • We’ll check the adapters manually, then trim them using cutadapt.
  • We also see a drop in sequence quality towards the end of the read. NOTE: You do not normally need to manually check for adapter contaimination. This exercise is just to show you what adapter “contamination” might look like.

If your SSH session closed and you had to login again, be sure that you are in your ngs directory

cd ~/ngs
  1. Make a softlink of the adapters FASTA file into your working directory
ln -s /workshops/ric_workshop/common/3_NGS/truseq.fa
  1. Take a peek at the adapters FASTA file. More information is available at https://support-docs.illumina.com/SHARE/adapter-sequences.htm
cat truseq.fa
>TruSeq_R1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
>TruSeq_R2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
  1. The read 1 TruSeq adapater starts with AGATCGGAA so, we can do a quick visual inspection to see if the adapter is present.
# zcat -f will "cat" both compressed and uncompressed text files
zcat -f test1.fastq.gz | grep --color AGATCGGAA

1.2.2 Trimming with cutadapt

  • We can give the complete adapter sequence, cutadapt will trim as much as it finds.
  • We’ll also trim to a minimum Q30 quality score from the end of the reads.
  1. Copy your SLURM template as a new script in the local directory
cp ~/slurm_template.sh trim.sh
  1. Edit (nano trim.sh) the SLURM script to look like the following. This SLURM script will perform the following.
    1. Trim the sequence data for TruSeq adapters and a minimum Q30 score.
    2. Generate a FastQC report of the trimmed data
    3. Generate a MultiQC report of the original and trimmed data. We added an --ignore parameter to command to have it ignore the previous set of FastQC reports that we generated for the multiple files. By default, MultiQC will scan the directory and all subdirectories of the path given. We will also specify a different file name.

You can also copy from /workshops/ric_workshop/common/3_NGS/trim.sh

#!/bin/sh
#SBATCH --partition=batch
#SBATCH --account=ric_workshop
#SBATCH -o %x.o%J

cd $SLURM_SUBMIT_DIR

module load cutadapt

cutadapt -a file:truseq.fa -q 30 -o test1_trimmed.fastq.gz test1.fastq.gz > test1_cutadapt.log

module load FastQC

fastqc test1_trimmed.fastq.gz

module load MultiQC

multiqc --ignore fastqc_out --filename trimmed_multiqc .

Remember: Ctrl+O to save, Ctrl+X to exit from nano.

  1. Submit the script to the queue
sbatch trim.sh
  1. Check the output file (trim.sh.o<number>)
This is cutadapt 4.2 with Python 3.10.4
Command line parameters: -a file:truseq.fa -q 30 -o test1_trimmed.fastq.gz test1.fastq.gz
Processing single-end reads on 1 core ...
Finished in 0.562 s (22.484 µs/read; 2.67 M reads/minute).

=== Summary ===

Total reads processed:                  25,000
Reads with adapters:                     1,689 (6.8%)
Reads written (passing filters):        25,000 (100.0%)

Total basepairs processed:     2,500,000 bp
Quality-trimmed:                 335,454 bp (13.4%)
Total written (filtered):      2,125,995 bp (85.0%)

=== Adapter TruSeq_R1 ===

Sequence: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA; Type: regular 3'; Length: 33; Trimmed: 1689 times

Minimum overlap: 3
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-33 bp: 3

Bases preceding removed adapters:
  A: 94.1%
  C: 2.3%
  G: 1.4%
  T: 2.0%
  none/other: 0.2%
WARNING:
    The adapter is preceded by 'A' extremely often.
    The provided adapter sequence could be incomplete at its 5' end.
    Ignore this warning when trimming primers.

Overview of removed sequences
length  count   expect  max.err error counts
3       183     390.6   0       183
4       98      97.7    0       98
5       40      24.4    0       40

...

73      6       0.0     3       2 1 3
74      17      0.0     3       5 7 4 1
75      1       0.0     3       1


=== Adapter TruSeq_R2 ===

Sequence: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT; Type: regular 3'; Length: 33; Trimmed: 0 times

WARNING:
    One or more of your adapter sequences may be incomplete.
    Please see the detailed output above.

1.2.3 View the FastQC and MultiQC reports

  • Download the test1_fastqc.html, test1_trimmed_fastqc.html and trimmed_multiqc.html files using SFTP. Once downloaded, open the reports – they should open in a web browser.

Raw data FastQC

Trimmed data FastQC

SIDEBAR

By default, the base unit for read counts in MultiQC is million (M). If you want to change to a different base unit, create a configuration file, e.g. my_report.conf, with settings for read_count_multiplier, read_count_prefix and read_count_desc. The following is an example configuration file that would set the base unit to thousands (k)

read_count_multiplier: 0.001
read_count_prefix: "k"
read_count_desc: "thousands"

Once you have created the configuration file, use the -c option of MultiQC to have it use the configuration file.

multiqc -c my_report.conf .

1.3 Genome alignment

If your SSH session closed and you had to login again, be sure that you are in your ngs directory

cd ~/ngs

1.3.1 Use an already-indexed genome that we’ve provided.

  • This is a BWA index. Other aligners like STAR, bowtie2, etc., will have their own index files.
  • mm39.fa is the fasta file of genomic sequences.
  • mm39.fa.fai is a fasta index file generated by samtools. We will discuss more in the afternoon.
  • The other files (.amb, .ann. .bwt, .pca. and .sa) are binary files generated by BWA during indexing.
ln -s /workshops/ric_workshop/common/3_NGS/mm39
ls -lhL mm39
total 8.1G
drwxrwsr-x 4 gchlip2 rrc_shared  512 Jun 10 20:06 Bisulfite_Genome
-rw-rw-r-- 1 gchlip2 rrc_shared 2.6G May 29 14:09 mm39.fa
-rw-rw-r-- 1 gchlip2 rrc_shared 5.8K May 29 17:00 mm39.fa.amb
-rw-rw-r-- 1 gchlip2 rrc_shared 5.1K May 29 17:00 mm39.fa.ann
-rw-rw-r-- 1 gchlip2 rrc_shared 2.6G May 29 17:00 mm39.fa.bwt
-rw-rw-r-- 1 gchlip2 rrc_shared 2.0K May 29 16:33 mm39.fa.fai
-rw-rw-r-- 1 gchlip2 rrc_shared 651M May 29 17:00 mm39.fa.pac
-rw-rw-r-- 1 gchlip2 rrc_shared 1.3G May 29 17:10 mm39.fa.sa
-rw-rw-r-- 1 gchlip2 rrc_shared 1.1G May 29 16:33 mm39.gtf

1.3.3 Run alignment using BWA MEM, save to a BAM file with samtools

  • We’ll discuss samtools and the SAM/BAM format after break
  • We need to write a SLURM script to submit this command.
  • Note that we’re skipping trimming. BWA MEM is a very sensitive aligner that will soft-clip read ends that don’t map, so we can skip trimming and the result will be pretty much the same.
  1. Copy your SLURM template script from your home directory
cp ~/slurm_template.sh mapping.sh

NOTE: you can add your fastq file names to the script ahead of time like this:

ls atac_subset*.fastq.gz >> mapping.sh
  1. Edit (nano mapping.sh) the SLURM script look like this (bwa mem and samtools commands should be on the same line).
    1. We are going to set the requested memory to 10 GB (#SBATCH --mem=10G). bwa-mem requires sufficient memory to load the genome index.
    2. We are going to Set the requested number of cores to 4 (#SBATCH --ntasks-per-node=4). bwa-mem can run some processes in parallel and can take advantage of multiple processing cores to run faster.

You can also copy from /workshops/ric_workshop/common/3_NGS/mapping.sh

#!/bin/sh
#SBATCH --mem=10G
#SBATCH --ntasks-per-node=4
#SBATCH --partition=batch
#SBATCH --account=ric_workshop
#SBATCH -o %x.o%J

cd $SLURM_SUBMIT_DIR

module load BWA
module load SAMtools

bwa mem -t $SLURM_TASKS_PER_NODE mm39/mm39.fa atac_subset_R1.fastq.gz atac_subset_R2.fastq.gz | 
   samtools view -b - > atac_subset.bam

Remember: Ctrl+O to save, Ctrl+X to exit from nano.

  1. Submit the job
sbatch mapping.sh

1.4 SAM/BAM format and samtools

  • Prints SAM to STDOUT, so pipe through head or less.
  • By default, samtools doesn’t show the header lines.
  • Use -h flag to show header. Use -H to show just the header (no reads).

If your last step (genome alignment) did not finish yet, you can link to the bam file below:

ln -s /workshops/ric_workshop/common/3_NGS/atac_subset.bam

1.4.1 Basic usage (view a file)

  1. Load the SAMtools module

  2. First, one can use “view” to see the alignments.

samtools view atac_subset.bam | head
NB501189:35:HTJL2BGXY:1:11101:12775:1048    121 19  37684741    60  41M =   37684741    0   GGAGACGTTTAGGAACCACCGCAGGAGACCACATANTGAAC   EEEEEEA/AEEEEEEEEEEEEEAEEEEEEEE/EEE#AAAAA   NM:i:1  MD:Z:35T5   MQ:i:0  AS:i:39 XS:i:0
NB501189:35:HTJL2BGXY:1:11101:12775:1048    181 19  37684741    0   *   =   37684741    0   TNCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNGGCCGGACN   E#EE############################EEEEAAAA#   MC:Z:41M    MQ:i:60 AS:i:0  XS:i:0
NB501189:35:HTJL2BGXY:1:11101:12274:1048    121 1   98511468    0   41M =   98511468    0   GATGAAAGGATGGACCATCTTGAGACTGCCATATCNAGGGA   EEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE#AAAAA   NM:i:1  MD:Z:35C5   MQ:i:0  AS:i:39 XS:i:39
NB501189:35:HTJL2BGXY:1:11101:12274:1048    181 1   98511468    0   *   =   98511468    0   TNATNNGNNNNNNNNNNNNNNNNNNNNNNNNNGGGTATTCN   E#EE##E#########################EEEEAAAA#   MC:Z:41M    MQ:i:0  AS:i:0  XS:i:0
NB501189:35:HTJL2BGXY:1:11101:12088:1048    73  11  102174317   60  41M =   102174317   0   GAAAGNAAGAAAGAAAAATGATTAAGGAGGGGCCAGTGAGA   AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE   NM:i:1  MD:Z:5A35   MQ:i:0  AS:i:39 XS:i:20
NB501189:35:HTJL2BGXY:1:11101:12088:1048    133 11  102174317   0   *   =   102174317   0   NCCCATACCNNNNNNNNNNNNNNNNNNNNNNNNNTNNTCNT   #AAAAEEEE#########################E##EA#E   MC:Z:41M    MQ:i:60 AS:i:0  XS:i:0
NB501189:35:HTJL2BGXY:1:11101:2303:1049 121 2   35873055    60  41M =   35873055    0   GAGGATGTACTCAGTAACTATTGTATGTATGTATGNATGTA   AEEEEAEAEE<EAEEE6EE6EEEEE/EEEE/EEE/#A/AAA   NM:i:1  MD:Z:35T5   MQ:i:0  AS:i:39 XS:i:20
NB501189:35:HTJL2BGXY:1:11101:2303:1049 181 2   35873055    0   *   =   35873055    0   ANAGCNCNNNNNNNNNNNNNNNNNNNNNNNNNACCCTGACN   E#/EE#E#########################A/EEAAAA#   MC:Z:41M    MQ:i:60 AS:i:0  XS:i:0
NB501189:35:HTJL2BGXY:1:11101:16869:1049    73  3   39342697    0   41M =   39342697    0   CCCATNAACATACAAGAAGCATACAGAACTCCAAATAGACT   AAAAA#EEEEEEEEE6EEEEEEEEEEEEEEEAEEEEEEEEE   NM:i:1  MD:Z:5G35   MQ:i:0  AS:i:39 XS:i:39
NB501189:35:HTJL2BGXY:1:11101:16869:1049    133 3   39342697    0   *   =   39342697    0   NTATAGTAGNNNNNNNNNNNNNNNNNNNNNNNNNANTGTNT   #AAAAEEEE#########################E#/EE#E   MC:Z:41M    MQ:i:0  AS:i:0  XS:i:0
  1. With “-h” flag we see the header lines first. You can pipe to “less” instead to page through into the alignment section.
samtools view -h atac_subset.bam | head
@HD VN:1.5  SO:unsorted GO:query
@SQ SN:1    LN:195154279
@SQ SN:2    LN:181755017
@SQ SN:3    LN:159745316
@SQ SN:4    LN:156860686
@SQ SN:5    LN:151758149
@SQ SN:6    LN:149588044
@SQ SN:7    LN:144995196
@SQ SN:8    LN:130127694
@SQ SN:9    LN:124359700

1.4.2 Convert between SAM and BAM

  • If writing to a SAM file, use the -h flag so that you get the header also and the SAM file is complete.
  • SAM is plain text, so it’s a bigger file. In general, there’s no reason to EVER store SAM files (always use BAM).
  • SAM is a much bigger file.

NOTE: for a real data set, this should be run as a SLURM job.

samtools view -h atac_subset.bam > atac_subset.sam
ls -lhL atac_subset.bam
ls -lhL atac_subset.sam
-rw-rw-r-- 1 gchlip2 rrc_shared 9.1M May 30 15:39 atac_subset.bam
-rw-rw-r-- 1 gchlip2 gchlip2 41M Jun 17 12:13 atac_subset.sam

If you need to convert a SAM file to BAM file, just use the -b flag.

samtools view -b atac_subset.sam > atac_subset_converted.bam

1.5 SAM flags

  1. Use samtools flagstat to count reads matching different combinations of flags
samtools flagstat atac_subset.bam
200000 + 0 in total (QC-passed reads + QC-failed reads)
200000 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
197766 + 0 mapped (98.88% : N/A)
197766 + 0 primary mapped (98.88% : N/A)
200000 + 0 paired in sequencing
100000 + 0 read1
100000 + 0 read2
195340 + 0 properly paired (97.67% : N/A)
196560 + 0 with itself and mate mapped
1206 + 0 singletons (0.60% : N/A)
490 + 0 with mate mapped to a different chr
216 + 0 with mate mapped to a different chr (mapQ>=5)
  1. Get a list of possible SAM flags
samtools flags
About: Convert between textual and numeric flag representation
Usage: samtools flags FLAGS...

Each FLAGS argument is either an INT (in decimal/hexadecimal/octal) representing
a combination of the following numeric flag values, or a comma-separated string
NAME,...,NAME representing a combination of the following flag names:

   0x1     1  PAIRED         paired-end / multiple-segment sequencing technology
   0x2     2  PROPER_PAIR    each segment properly aligned according to aligner
   0x4     4  UNMAP          segment unmapped
   0x8     8  MUNMAP         next segment in the template unmapped
  0x10    16  REVERSE        SEQ is reverse complemented
  0x20    32  MREVERSE       SEQ of next segment in template is rev.complemented
  0x40    64  READ1          the first segment in the template
  0x80   128  READ2          the last segment in the template
 0x100   256  SECONDARY      secondary alignment
 0x200   512  QCFAIL         not passing quality controls or other filters
 0x400  1024  DUP            PCR or optical duplicate
 0x800  2048  SUPPLEMENTARY  supplementary alignment
  1. Check the meaning of flags.
    • First by number
samtools flags 4
0x4 4   UNMAP
  • Then by name(s)
samtools flags UNMAP,PAIRED
0x5 5   PAIRED,UNMAP
  1. Filter a SAM/BAM file based on flags
  • -F[flag] = print if flag is false
  • -f[flag] = print if a flag is true
    • Flag 4 means read unmamped. So -F4 means print flags that are not unmapped (i.e., that are mapped).
samtools view -F 4 atac_subset.bam | head
NB501189:35:HTJL2BGXY:1:11101:12775:1048    121 19  37684741    60  41M =   37684741    0   GGAGACGTTTAGGAACCACCGCAGGAGACCACATANTGAAC   EEEEEEA/AEEEEEEEEEEEEEAEEEEEEEE/EEE#AAAAA   NM:i:1  MD:Z:35T5   MQ:i:0  AS:i:39 XS:i:0
NB501189:35:HTJL2BGXY:1:11101:12274:1048    121 1   98511468    0   41M =   98511468    0   GATGAAAGGATGGACCATCTTGAGACTGCCATATCNAGGGA   EEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE#AAAAA   NM:i:1  MD:Z:35C5   MQ:i:0  AS:i:39 XS:i:39
NB501189:35:HTJL2BGXY:1:11101:12088:1048    73  11  102174317   60  41M =   102174317   0   GAAAGNAAGAAAGAAAAATGATTAAGGAGGGGCCAGTGAGA   AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE   NM:i:1  MD:Z:5A35   MQ:i:0  AS:i:39 XS:i:20
NB501189:35:HTJL2BGXY:1:11101:2303:1049 121 2   35873055    60  41M =   35873055    0   GAGGATGTACTCAGTAACTATTGTATGTATGTATGNATGTA   AEEEEAEAEE<EAEEE6EE6EEEEE/EEEE/EEE/#A/AAA   NM:i:1  MD:Z:35T5   MQ:i:0  AS:i:39 XS:i:20
NB501189:35:HTJL2BGXY:1:11101:16869:1049    73  3   39342697    0   41M =   39342697    0   CCCATNAACATACAAGAAGCATACAGAACTCCAAATAGACT   AAAAA#EEEEEEEEE6EEEEEEEEEEEEEEEAEEEEEEEEE   NM:i:1  MD:Z:5G35   MQ:i:0  AS:i:39 XS:i:39
NB501189:35:HTJL2BGXY:1:11101:4079:1049 73  13  101638401   60  41M =   101638401   0   GTTTCNGGCAGCCCTGTCTCCTGCCAAACTTGGGCACTTGG   AAAAA#EEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEE   NM:i:1  MD:Z:5A35   MQ:i:0  AS:i:39 XS:i:19
NB501189:35:HTJL2BGXY:1:11101:9468:1049 121 2   92996906    60  41M =   92996906    0   TGCCCGCTCCCTTCCCTCGGCTACTCTCTGCCTTTNTTGGA   /EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE#AAAAA   NM:i:1  MD:Z:35G5   MQ:i:0  AS:i:39 XS:i:0
NB501189:35:HTJL2BGXY:1:11101:2945:1049 73  7   79852260    60  41M =   79852260    0   GTTGGNGGCGCCTGCTCTGCCCAGGTCCCAGGGTCGATGGG   AAAAA#6EEEEEEEEEAEEEEEEAEAEEAEEEEEAEEEEEE   NM:i:1  MD:Z:5T35   MQ:i:0  AS:i:39 XS:i:19
NB501189:35:HTJL2BGXY:1:11101:17448:1049    73  X   12466241    46  41M =   12466241    0   GGGCTNCACAGAGTAAACCTTGTCTCGAAAAACCAAAAAGA   AAAAA#EEEEEEEEEE6EEEEEEEEEEEEEEEEEEEEEEEE   NM:i:1  MD:Z:5A35   MQ:i:0  AS:i:39 XS:i:31
NB501189:35:HTJL2BGXY:1:11101:8077:1049 73  1   183573501   60  41M =   183573501   0   GAGCANGAGTCTTCAATTCACGTTGTTTACAGATCATGGGA   AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE   NM:i:1  MD:Z:5T35   MQ:i:0  AS:i:39 XS:i:0

We can use samtools view with a filter to get unmapped reads from a BAM file, in case we want to investigate those sequences more. The -b option tells samtools to output as BAM (i.e., compressed), which is better if we’re saving into a new file.

NOTE: for a real data set, these should be run as a SLURM job.

samtools view -f 4 -b atac_subset.bam > atac_subset_unmapped.bam
samtools flagstat atac_subset_unmapped.bam
2234 + 0 in total (QC-passed reads + QC-failed reads)
2234 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
0 + 0 mapped (0.00% : N/A)
0 + 0 primary mapped (0.00% : N/A)
2234 + 0 paired in sequencing
698 + 0 read1
1536 + 0 read2
0 + 0 properly paired (0.00% : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

We could also do this in one command using a pipe, and without making an intermediate file. The solitary dash (-) in the last command tells samtools to read the alignment data from STDIN rather than a specific file.

NOTE if piping from samtools view include either the -h (include headers) or -b (BAM output) in the view command. When reading from STDIN, the uncompressed output tends to be a bit faster.

samtools view -f 4 -h atac_subset.bam | samtools flagstat -
2234 + 0 in total (QC-passed reads + QC-failed reads)
2234 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
0 + 0 mapped (0.00% : N/A)
0 + 0 primary mapped (0.00% : N/A)
2234 + 0 paired in sequencing
698 + 0 read1
1536 + 0 read2
0 + 0 properly paired (0.00% : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

1.5.1 Something a bit more sophisticated… (OPTIONAL/TAKE HOME)

In this example, we are going to look for reads in which both pairs are not mapped (fully unmapped). For these examples, we are just going to pipe the output to samtools flagstat to see the count statistics. However, one could save the output to a BAM file for further analyses.

Also, there are other options for filtering. For more details you can check the usage statment samtools view --help or check the documentation at https://www.htslib.org/doc/samtools-view.html.

  1. First find the proper flags.
samtools flags
About: Convert between textual and numeric flag representation
Usage: samtools flags FLAGS...

Each FLAGS argument is either an INT (in decimal/hexadecimal/octal) representing
a combination of the following numeric flag values, or a comma-separated string
NAME,...,NAME representing a combination of the following flag names:

   0x1     1  PAIRED         paired-end / multiple-segment sequencing technology
   0x2     2  PROPER_PAIR    each segment properly aligned according to aligner
   0x4     4  UNMAP          segment unmapped
   0x8     8  MUNMAP         next segment in the template unmapped
  0x10    16  REVERSE        SEQ is reverse complemented
  0x20    32  MREVERSE       SEQ of next segment in template is rev.complemented
  0x40    64  READ1          the first segment in the template
  0x80   128  READ2          the last segment in the template
 0x100   256  SECONDARY      secondary alignment
 0x200   512  QCFAIL         not passing quality controls or other filters
 0x400  1024  DUP            PCR or optical duplicate
 0x800  2048  SUPPLEMENTARY  supplementary alignment
  1. Find the proper bit value. We want unmapped (UNMAP) and mate is unmapped (MUNMAP)
samtools flags UNMAP,MUNMAP
0xc 12  UNMAP,MUNMAP
  1. Run the filter (via view) and the get the stats. We could save to a file, if we wanted.
samtools view -f 12 -h atac_subset.bam | samtools flagstat -
1028 + 0 in total (QC-passed reads + QC-failed reads)
1028 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
0 + 0 mapped (0.00% : N/A)
0 + 0 primary mapped (0.00% : N/A)
1028 + 0 paired in sequencing
514 + 0 read1
514 + 0 read2
0 + 0 properly paired (0.00% : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
  1. We could also run the filter (via view) using the names of the flags
samtools view -f UNMAP,MUNMAP -h atac_subset.bam | samtools flagstat -
1028 + 0 in total (QC-passed reads + QC-failed reads)
1028 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
0 + 0 mapped (0.00% : N/A)
0 + 0 primary mapped (0.00% : N/A)
1028 + 0 paired in sequencing
514 + 0 read1
514 + 0 read2
0 + 0 properly paired (0.00% : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Here is a final example of filtering for reads that are unmapped but, the mate is mapped. In this case, the -f option will require reads to have this flag (UNMAP) and the -F option will exclude any reads with this flag set (MUNMAP)

samtools view -f UNMAP -F MUNMAP -h atac_subset.bam | samtools flagstat -
1206 + 0 in total (QC-passed reads + QC-failed reads)
1206 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
0 + 0 mapped (0.00% : N/A)
0 + 0 primary mapped (0.00% : N/A)
1206 + 0 paired in sequencing
184 + 0 read1
1022 + 0 read2
0 + 0 properly paired (0.00% : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

1.6 Sort and index BAM

1.6.1 Coordinate sort our BAM file.

  • Most downstream applications require a coordinate-sorted BAM file.
  • Once we’ve sorted, we can also index for random access

NOTE: for a real data set, this should be run as a SLURM job.

samtools sort atac_subset.bam > atac_subset_sort.bam
samtools index atac_subset_sort.bam

Now we can filter reads in specific regions of the genome by the genomic coordinates. For example, an arbitrary 100bp region in chr1:

samtools view atac_subset_sort.bam 1:24650681-24650781 | wc -l
63
samtools view atac_subset_sort.bam 1:24650681-24650781 | head
NB501189:35:HTJL2BGXY:1:11101:11619:6108    163 1   24650643    0   41M =   24650675    73  TTTTAAATCTGTTTGGCGTAAGCAGATTGAGCTAGTTATAA   AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAE   NM:i:0  MD:Z:41 MC:Z:41M    MQ:i:0  AS:i:41 XS:i:41 XA:Z:MT,-10975,41M,0;
NB501189:35:HTJL2BGXY:1:11101:17117:6255    99  1   24650643    0   41M =   24650797    195 TTTTAAATCTGTTTGGCGTAAGCAGATTGAGCTAGTTATAA   AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE   NM:i:0  MD:Z:41 MC:Z:41M    MQ:i:0  AS:i:41 XS:i:41 XA:Z:MT,-10975,41M,0;
NB501189:35:HTJL2BGXY:1:11101:7288:10786    99  1   24650643    0   41M =   24650658    56  TTTTAAATCTGTTTGGCGTAAGCAGATTGAGCTAGTTATAA   AAAAAEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE   NM:i:0  MD:Z:41 MC:Z:41M    MQ:i:0  AS:i:41 XS:i:41 XA:Z:MT,-10975,41M,0;
NB501189:35:HTJL2BGXY:1:11101:10432:5614    83  1   24650644    0   41M =   24650620    -65 TTTAAATCTGTTTGGCGTAAGCAGATTGAGCTAGTTATAAT   EEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA   NM:i:0  MD:Z:41 MC:Z:41M    MQ:i:0  AS:i:41 XS:i:41 XA:Z:MT,+10974,41M,0;
NB501189:35:HTJL2BGXY:1:11101:22907:13094   83  1   24650644    0   41M =   24650619    -66 TTTAAATCTGTTTGGCGTAAGCAGATTGAGCTAGTTATAAT   EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA   NM:i:0  MD:Z:41 MC:Z:41M    MQ:i:0  AS:i:41 XS:i:41 XA:Z:MT,+10974,41M,0;
NB501189:35:HTJL2BGXY:1:11101:20582:14664   99  1   24650652    0   41M =   24650674    63  TGTTTGGCGTAAGCAGATTGAGCTAGTTATAATTATTCCTC   AAAAAEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEE   NM:i:0  MD:Z:41 MC:Z:41M    MQ:i:0  AS:i:41 XS:i:41 XA:Z:MT,-10966,41M,0;
NB501189:35:HTJL2BGXY:1:11101:26577:4445    99  1   24650653    0   41M =   24650798    186 GTTTGGCGTAAGCAGATTGAGCTAGTTATAATTATTCCTCA   AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEE   NM:i:0  MD:Z:41 MC:Z:41M    MQ:i:0  AS:i:41 XS:i:41 XA:Z:MT,-10965,41M,0;
NB501189:35:HTJL2BGXY:1:11101:24672:10931   99  1   24650657    0   41M =   24650667    51  GGCGTAAGCAGATTGAGCTAGTTATAATTATTCCTCATAGG   6AAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE   NM:i:0  MD:Z:41 MC:Z:41M    MQ:i:0  AS:i:41 XS:i:41 XA:Z:MT,-10961,41M,0;
NB501189:35:HTJL2BGXY:1:11101:7288:10786    147 1   24650658    0   41M =   24650643    -56 GCGTAAGCAGATTGAGCTAGTTATAATTATTCCTCATAGGG   EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA   NM:i:0  MD:Z:41 MC:Z:41M    MQ:i:0  AS:i:41 XS:i:41 XA:Z:MT,+10960,41M,0;
NB501189:35:HTJL2BGXY:1:11101:7732:7456 163 1   24650666    0   41M =   24650675    50  AGATTGAGCTAGTTATAATTATTCCTCATAGGGAGAGAAGG   AAAAAEEEEEEEEEAEAEEEEEEEEEEEEEEEEEEEEEEEE   NM:i:0  MD:Z:41 MC:Z:41M    MQ:i:0  AS:i:41 XS:i:41 XA:Z:MT,-10952,41M,0;

We can also pipe to flagstat, but need to include the headers for this so that SAM data is properly formatted.

samtools view -h atac_subset_sort.bam 1:24650681-24650781 | samtools flagstat -
63 + 0 in total (QC-passed reads + QC-failed reads)
63 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
63 + 0 mapped (100.00% : N/A)
63 + 0 primary mapped (100.00% : N/A)
63 + 0 paired in sequencing
33 + 0 read1
30 + 0 read2
63 + 0 properly paired (100.00% : N/A)
63 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

1.7 Remove PCR duplicates

  • We can remove apparent PCR duplicates to avoid biases in downstream analysis.
  • Important points to keep in mind:
    • This analysis assumes that reads that align to the exact same position are duplicates.
    • For PE data, it looks like the alignments of BOTH R1 and R2. So the inference is much more precise for PE data.
    • The BAM file MUST be coordinate-sorted first
  • We will need to rerun the samtools index on the new BAM file.

NOTE: for a real data set, this should be run as a SLURM job.

samtools rmdup atac_subset_sort.bam atac_subset_nodups.bam
...
[bam_rmdup_core] processing reference GL456394.1...
[bam_rmdup_core] processing reference GL456396.1...
[bam_rmdup_core] processing reference JH584292.1...
[bam_rmdup_core] processing reference JH584293.1...
[bam_rmdup_core] processing reference JH584294.1...
[bam_rmdup_core] processing reference JH584296.1...
[bam_rmdup_core] processing reference JH584298.1...
[bam_rmdup_core] processing reference JH584299.1...
[bam_rmdup_core] processing reference JH584304.1...
[bam_rmdup_core] 849 / 97918 = 0.0087 in library '  '
samtools flagstat atac_subset_sort.bam
samtools flagstat atac_subset_nodups.bam
200000 + 0 in total (QC-passed reads + QC-failed reads)
200000 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
197766 + 0 mapped (98.88% : N/A)
197766 + 0 primary mapped (98.88% : N/A)
200000 + 0 paired in sequencing
100000 + 0 read1
100000 + 0 read2
195340 + 0 properly paired (97.67% : N/A)
196560 + 0 with itself and mate mapped
1206 + 0 singletons (0.60% : N/A)
490 + 0 with mate mapped to a different chr
216 + 0 with mate mapped to a different chr (mapQ>=5)
198323 + 0 in total (QC-passed reads + QC-failed reads)
198323 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
196089 + 0 mapped (98.87% : N/A)
196089 + 0 primary mapped (98.87% : N/A)
198323 + 0 paired in sequencing
99163 + 0 read1
99160 + 0 read2
193665 + 0 properly paired (97.65% : N/A)
194883 + 0 with itself and mate mapped
1206 + 0 singletons (0.61% : N/A)
490 + 0 with mate mapped to a different chr
216 + 0 with mate mapped to a different chr (mapQ>=5)

2 Afternoon

2.1 Peak calling

2.2 Peak calling results

If your last step (peak calling) did not finish yet, you can link to the results below:

ln -s /workshops/ric_workshop/common/3_NGS/atac_peaks.narrowPeak
ln -s /workshops/ric_workshop/common/3_NGS/atac_summits.bed
ln -s /workshops/ric_workshop/common/3_NGS/atac_treat_pileup.bdg

2.2.1 Peak calls (bed/narrowPeak)

head atac_peaks.narrowPeak
1   3820136 3820336 atac_peak_1 31  .   4.25793 5.39176 3.14871 70
1   4855527 4856425 atac_peak_2 223 .   12.7738 25.5049 22.368  339
1   4877600 4878435 atac_peak_3 310 .   15.9672 34.4755 31.0459 402
1   4927683 4929170 atac_peak_4 325 .   16.4995 36.0233 32.5467 195
1   5153075 5153839 atac_peak_5 169 .   10.6448 19.871  16.9386 453
1   5505186 5505419 atac_peak_6 61  .   5.85466 8.56696 6.12342 130
1   5893249 5893470 atac_peak_7 50  .   5.32241 7.46641 5.08531 101
1   6284396 6285693 atac_peak_8 251 .   13.8383 28.4312 25.1955 364
1   6294317 6294587 atac_peak_9 83  .   6.91914 10.8771 8.31705 140
1   6332640 6332895 atac_peak_10    50  .   5.32241 7.46641 5.08531 67
head atac_summits.bed
1   3820206 3820207 atac_peak_1 3.14871
1   4855866 4855867 atac_peak_2 22.368
1   4878002 4878003 atac_peak_3 31.0459
1   4927878 4927879 atac_peak_4 32.5467
1   5153528 5153529 atac_peak_5 16.9386
1   5505316 5505317 atac_peak_6 6.12342
1   5893350 5893351 atac_peak_7 5.08531
1   6284760 6284761 atac_peak_8 25.1955
1   6294457 6294458 atac_peak_9 8.31705
1   6332707 6332708 atac_peak_10    5.08531
head atac_treat_pileup.bdg
1   0   3050248 0.00000
1   3050248 3050416 0.12170
1   3050416 3050448 0.24339
1   3050448 3050616 0.12170
1   3050616 3051065 0.00000
1   3051065 3051089 0.12170
1   3051089 3051163 0.24339
1   3051163 3051216 0.36509
1   3051216 3051265 0.48678
1   3051265 3051279 0.36509

2.2.2 Use bedtools to get sequences of narrow peak regions

For this exercise, we will just pipe the output to head to get a peek at the results. We have included the -name option to have bedtools report the name of the peaks in the bed file with the FASTA output. To save the getfasta output to a file you can use the -fo option of bedtools sort to specify an output file or redirect STDOUT to a file.

module load BEDTools
bedtools getfasta -name -fi mm39/mm39.fa -bed atac_peaks.narrowPeak | head
>atac_peak_1::1:3820136-3820336
CCTGAAAAAATGTTCAACATCCTTAATCATCAGGGAAATGCAAATCAAAACAACCCTGAGATTCCACCTCACACCAGTCAGAATGGCTAAGATCAAAAATTCAGGTGACAGCAGATGCTGGCGTGGATGTGGAGAAAGAGGAACACTCCTCCATTGTTGGTGGGATTGCAGGCTTGTACAACCACTCTGGAAATCAGTCT
>atac_peak_2::1:4855527-4856425
CAGTTTTCTAGTTGCACCCTGGGTTGCACATTGGCCAGCTGGAATGTAGAAAAAACACTGGGATTTTTATAACGACGATAGGCACTTGCTAAGCAGTTCAGTTTAAACGCAATTTTGGATGAATCGTCTGCAGTTCCCGAGTGTGTACCCTCCGCCTGCTTAGCATCCTTAACACTGGGCGCACGCGGAAGGGAGCCAGGCGCCCGCACGCGTCCATCGCCAGAGTGACGCGGCCCCTGCACTCCCCGCACCGCCGCCAAGCGTTTACCCGTTTTCTGGAGTTAGGACTGGGCTTCAGGTTGGCCAGGCTCACTCTCGGCAAGGACCGCAGCAGGTCCAGGCTGGTCCCGCAGCCGCGCGCCGTGCCAGCCATGACCCGCGGTGCTTCAAGGGCGCCCAGGAAGCGGCGTAGACTCCGCGTTGCTTTACGGCTCCGCCCGGCCCCTGTCGCGCTGCAGCGGCCGTACTGCCCCGCCCCTGTCGTGCTGCAGAGCGACCGCATCCGCGTCGGTAGGCTATGTCTTCCAGCTTCCAGGCTTGATGTCTTTTTACTTCGGGGAACTGCATCAACTTTTGTGAGATCATGCTTTATCCTCTGGCTTTGGTACCATTTTGTGAACGAAAACTGTTCTCTTCTGCCCTGTAATGACTGAACACGCTGTACGGTAAAAGTGGGGCCTCATGAAGCGAAAAGTAATCCAAAATGATGTGAACAGGTTCAAGCTTTGAGACTCTGGTAATTCAACTGTTACAGTGCCTGGCTCTAGAATTAATTGTCTAATTAGGTTCAACCTAATAGCGTGCCTTGCAGTTAACGATGGGGGGTGGGGCATACTTAGCTCCTGGGGTATGCCAGAAACTGAGTTAAAGTTAGCCTTACTAGATTCATATAAACAAC
>atac_peak_3::1:4877600-4878435
GGTCATGACATTTTGACCTTAAATGTTTGCCATGAACCGTTCCCCAGTTATACATACTATATATGCAATAAATTCAGCGCTGTGCAGAAGGGTGGCGATTTTGCCCCGTGTTCAAAACGGCTATATTTCAATGAATACCTGTCCTTTAGTTAATAACGTTGACCTGCCTGGGGAAGTGAAATCGGGAGTCACGCGTCAAGCGCGCAGGGGACCACAGCAAGAGCTCGCCCCGCCTAGCCGCGGTCCCAGCAGCGGTCCCTGCTACCTGCCGCTGGTCCAGGCACTGCGGGCGTGGATGGACAGCGAGAGGTTAGCAGCGGCCCGCACTGCTCCAGCGCGGTGCGGGAAGGCTCACTATGACGCGCACGCGCGGCCGAATCGGGGCGCGAGCTCGGGGCCGAACGCGAGGAACAGTGCAGGGCGGCGCGGGGGCGCGCACGCGCCTGAGCGCGCGCCCGGAGGGGCGGGCTGGGACTTTCGGCTGCCGGGAGCCCGAGTTCCCTTCCGCTTCCGACGCACTGTCCGCCAGCCGGTGGATGTGCGGCAACAACATGTCCGCTCCGATGCCCGCCGTTGTGCCGGCCGCCCGGAAGGCCACCGCCGCGGTGAGTGGCGGCGCCCGGGCCTGGGGCACCAGCGAGACCCAGGCCGCCGCGCGTGGATGCGTCCGCGCCGCCCCGCGGGCTCTCCGGGGTCGGGCGGGGAATCGGAGCCGGCGGGGCGCCGCCTGCCTGCCTAGCACTCTCACCCCCACCCCTGAAGCAAATTCCGAGGTCCAGTGCTGAAACAGCCGACCTGCAGCGGGAGTGTCTGTCCTGCCGCAGGAGCTCAGATT
>atac_peak_4::1:4927683-4929170
TCATGCGCTTTAAGCCCTCGGCAATGCCTGTCCTGCGTCCCAGAGAACGCTCTGCCGGAGGGGTTTCGATGGAACTCGTAGCAACCTACCGCCTACTGCCTGATCCCTCTGGCGTGAAAGCCGGACTCCGTCCAACTCCAGCTCGCCAGCAACGCGAGTCCGGATAGGGCCGGAAGTTCCAGACTGCTGGGGGCGGGAATAGATTAAAAGAACAGCGCACGCCTGAGCGAGCCACTTCTACTTTCCAGTCTCCTGCGATCGATTCGTAGTGCCTGTGTGGCGCGCGCGGTGCTTGACTGGCACGGCCTAGGGGGCGGAGCGCTTATCCCTGCCGCCGCGGGCCGGGTCTGTGAGGAAGGCCTAGGCCAGCGGCTTCGCGGCTTGTCCAACGTCCGCGCAGCTTCTCTCGCTTCCCGTCCGCTGTGGCCGGGCGAGGGGAGGCCTTCCGGAGCCATGGAGGACGAGGTGGTTCGCATTGCCAAGAAGATGGACAAAATGGTGCAGAAAAAGAATGCGGTAAGCGGGCGGGCCCGGGGCAGGCCGCGGCCGAGTCCGCGACACCGGCCCACGCGGAGTCAGTCAGGGCCGGGCCGGGCCGGGCCGCGCCGCGCGCCCTCCGGGCTTTCAGGTCCGTGCTGCCCACGGCCGCAGTGAGCCCAGGACTGAAGCCCTCGACGCTGGGCTGAGAGAATGTACGGTGGCCGCCGCCGGTTACCCCAGAGCGGCTTCGCTGGGTGCCGCGGTCAGTTGCAGACGGGCCGTAAAGTGTGTTTATCCCAGCTCTGCCTTACACCCACTTGGTAATCACAAACAGAAAATACGCCGCCCCTTGGCAAGTCGGCCTGCGTGCGATGTGCCGGGCAGCCGGCTGCTGCTGCTCCGCGGGTATGGTGCTGAGGCTGCAGCGGGACGGATGACATCCTCTGTTCCTTGATTGGTCTGTGTGGATGTGTCCCTCGGTGTGTACCACTTTCCATCCCTTTGCTTCCACACTTGACCTCTCCAGCCAGCTTCACTTGAATCTCTTCTTCCAGACGCAGCAGGTTTGCATTATCAAAACACCAGAGACTTTTGCCATTTGTGAAGTAAAAGCCGTGCAGAGATGCTAAACGTTTGTGCAGTGAGTGCTTTAGAGTTACGCGCCTCTGCAGAATTCTTTTTATGAGTCAAGAAACTGAGTCAGAGCTAAAGTGCTGTCGCTTAAATACTGTGTAGTTTTGCTTGAGCTAGGAGCTCGCTGACTTTCCACTCGCTGCCAGCTTAGGATTTAGTGTTAATACCAAATAATGTAAGACTTTCAGACGCCCCCGGTTCCTCTTTCTTAAACAGGTGTGTTGACCCGCATTTAATTTCATGCCCCGTTAATCCGTATGTACGATTTTGGTCCTTTCAAATATTTCCTGTTTGTTTTCTGTTCCAGATATTAGCCTGTGTTGTTTTTCTTTCTTACTGATTGAGCTATTTCATTACTTGATGCAGGTGATGAG
>atac_peak_5::1:5153075-5153839
CTTAAAATCGGAGTATGAATGCCTGGCCCGCTGAGATGTTTAATTCAAACAAACCCAAACCACTCAAAGGTCATAAACCAAGCTTCTTCAAAGATTTTTGGCTTTTTGGCACCAGTGGCCTGCAGGGTGGCGAGCTCTGCCAGTTTGAAGTGACCAAGTTAAGTGGCCTGGGAAAGGCCATTTGGTGCGCGGTCCAGCAGTTTTGGGCGCTCTCGGCTTCCGCCCTCAGCTGCGGTCACGTGCGGCTGCTCACGTGCCAGACGCTGCTGTCACTTCGTAGCTGTTCCGGCTTCCTCTGAGTGAGGCTCGCAACGTCTCCCACGGAGTCGCCTTCGTTCTGCTCTGGGTCTCCCGTGGCCACTGAGACCTCGGAGCTCGACCGGCGCCTGCCCGCCCGTGCGGCCCTCACTCCCCGAGGCTATCCAGGTGAGGCCGCCTGGGGTCCCTCCCCGGCTCCGGAGAGCCGACTGGTTTCCCTGCCGGCCGCGCCAGGGTCTGCGGAGCCGCAGGGCCTTCCTCCCCTCGGAGGGAAGCAGAGAGCGCGGGGGTGGGCAGGCCTGGGCCGGGCCGGGCCGGGCCGGGAGAGCCGCGGGGGTCGGGCCGCCCGGGAAGGCTGCGCTGTGCCTTTGGTGCCAGCCTTGCCCGGATGGAGCCGGTTTTACCTGGGACCCGCAGTCTGCGCCGACCTTGCTTGTTCAGGCAGCGAGTTCCCACAAAGAGAAATCTAAGTTCCACACTGACTAAGCACTTCGGCAACGCGGAAT

2.3 BigWig files

The bedgraph (.bdg) output from Macs2 is the normalized enrichment across the genome. It can be useful for visualization in a genome browser, but we should make it into a bigWig file first.

  • BigWig is the binary compressed and indexed version of bedGraph or wig, kind of like BAM is to SAM.
  • We’ll use tools from the UCSC genome browser to convert.
  • One file we need is a list of chromosome sizes. Usefully, we have this already in the form of our mm39 fasta index file!
  • Note that bedGraphToBigWig is particular about how the chromosomes are sorted. The .bdg file IS sorted by chromosome, but we sort it again using bedSort (also from UCSC) to be sure.
  1. Copy your SLURM template script from your home directory
cp ~/slurm_template.sh bdgtobw.sh
  1. Edit (nano bdgtobw.sh) the SLURM script to look like the following:
#!/bin/bash
#SBATCH --partition=batch
#SBATCH --account=ric_workshop
#SBATCH -o %x.o%J

cd $SLURM_SUBMIT_DIR

module load BEDTools
bedtools sort atac_treat_pileup.bdg atac_treat_sorted.bdg 

module load ucsc
bedGraphToBigWig atac_treat_sorted.bdg mm39/mm39.fa.fai atac_treat_pileup.bw

Remember: Ctrl+O to save, Ctrl+X to exit from nano.

  1. Then submit the job
sbatch bdgtobw.sh

After it finishes, compare file sizes:

ls -lh atac_treat_pileup.*
-rw-r--r-- 1 mmaiensc domain users 489M Jun 27 16:00 atac_treat_pileup.bdg
-rw-r--r-- 1 mmaiensc domain users  80M Jun 27 16:49 atac_treat_pileup.bw

If your last step (peak calling) did not finish yet, you can link to the results below:

ln -s /workshops/ric_workshop/common/3_NGS/atac_treat_pileup.bw

2.4 IGV

For this exercise we are going to look at the files generated by MACS2.

  • First, download the following files
    • atac_nodups.bam
    • atac_nodups.bam.bai
    • atac_peaks.narrowPeak
    • atac_summits.bed
    • atac_treat_pileup.bw

2.4.1 Load the data

  1. Once you have download the files open IGV.

  2. Set the genome as mm39.

    1. From the Genomes menu select Download Hosted Genome…
    2. In the Hosted Genomes dialog, enter mm39 in the filter to quickly find the genome. Then select on “Mouse mm39”
    3. Leave Use remote sequence and Use remote annotations selected.
    4. click OK

  1. Load your files.
    1. From the File menu select Load from File…
    2. In the Select Files dialog, navigate to the folder where you downloaded files and select atac_nodups.bam, atac_peaks.narrowPeak,atac_summits.bedandatac_treat_pileup.bw`.
      • You can select all of the data at once if you want (command-click on Mac or control-click on PC)
      • DON’T select the .bai file. It just needs to be in the same folder as the .bam file.
    3. Click OK

(exercise continues on next page)

2.4.3 Viewing data

You have a large number of options for customizing the view:

  • Right-click on the alignments to:
    • Change packing (squished/collapsed/expanded). Choose squished
    • Change coloring of read pairs. Choose “first of pair strand”
      • Note the option for bisulfite mode
    • Rename the track
    • Copy a read sequence
  • Right-click on the coverage to:
    • Rescale the window (Set Data Range), or to change to log-scale
    • Change the track height
    • Rename the track
  • Right-click on the gene annotation to:
    • Change to expanded or collapsed (expanded lets you see all isoforms)
    • Copy the gene sequence
  • If you zoom in to see the genomic sequence, you can right-click to show you the translation
  • Hover the mouse over different tracks to see extra information
  • Click and drag on track names to move/reorder the tracks
  • Right-click on the track name to delete

EXAMPLE: You can also view SNPs and other variants if you zoom in

EXAMPLE For RNA-seq data mapped in a splice-aware manner (STAR or TopHat), we can also see the splice junctions

  • If you are looking at RNA-seq data you can right-click on alignments, choose Show Splice Junction Track

2.4.4 Saving images

You can save views from IGV to include in manuscripts, presentations, posters, etc.:

  • Go to File > Save Image. You can a couple file format options
    • .png (bitmap), good for a quick snapshot of a particular view
    • .svg (vector graphics), good for making custom high-quality figures. You will need to open the .svg file in Adobe Illustrator to clean it up a bit, but this allows you to modify the text, colors, and other features of the plot while retaining arbitrarily high resolution.

2.4.5 Saving sessions

You can save a particular view in IGV as a session. This is very useful if you’ve loaded a lot of data and configured the tracks in a specific way, and you want to save this configuration to return to later.

  • Go to File > Save Session
  • To load a previous session, go to File > Open Session
  • The session will be saved as an .xml file. This file has all of the data sets you’ve loaded in IGV, plus all of the view and track configurations you’ve picked, down to the genomic position you’re currently looking at.
  • IMPORTANT: The .xml file includes the folder/names of all of the data you’ve loaded. If you move any of the data sets around and try to reload the session, this will fail.

2.5 samtools faidx

The samtools faidx command allow one to quickly extract regions from a FASTA file. As part of this command, samtools will generate a FASTA index (.fai) file.

  • Unlike other file indices, the FASTA index (.fai) is plain text, and we can look at it to see chromosome size.
  • The index allows us to perform random access to genomic sequences based on coordinates.
  • If you didn’t have an index made already, you can generate one with the command:
    • samtools faidx mm39.fa
    • But this should be run on a SLURM job, as it can take some time for a big genome.
  • The second column is the chromosome length, and the third column is (mostly) the cumulative genome size.
head mm39/mm39.fa.fai
chr1    195471971   8   60  61
chr2    182113224   198729854   60  61
chr3    160039680   383878307   60  61
chr4    156508116   546585323   60  61
chr5    151834684   705701916   60  61
chr6    149736546   860067187   60  61
chr7    145441459   1012299351  60  61
chr8    129401213   1160164843  60  61
chr9    124595110   1291722751  60  61
chr10   130694993   1418394457  60  61

Let’s get the genomic coordinates that correspond to the first peak on chr12.

grep "chr12" atac_peaks.narrowPeak | head -1
module load SAMtools
samtools faidx mm39/mm39.fa chr12:3109722-3110238
>chr12:3109722-3110238
CAGCTTCCTATTTCTCTCCTCTCCCAAGAAAGGCTGGTGTCCTTATAAGGGAGACAGCCT
CCATTTCCAACAGTGTGGAGAGCTAAATAAATACTCTCTCACAGTCAATGCAGAGGTAAC
TTTTTTTTTTTTTTTTGATGGATTAAAATCACTGAAAATCACGGAAAATGAGAAATACAC
ACTTTAGGACGTGAAATATGGCGAGGAAAACTGAAAAAGGTGGAAAATTTAGAAATGTCC
ACTGTAGGACATGGAATATGGCAAGAAAACTGAAAATCATGGAAAATGAGAAACATCCAC
TTGACGACTTGAAAAATGACAAAATCACTGAAAAAGGTGAAAAATGAGAAATGCACACTG
TAGGACTTGGAATATGGCGAGAAAACTGAAAATCACGGAAGTGAGGGGCTTTTGTCTCTC
TCTCAGTGGGAGCTGCCATTTTAATGGGACCTCCAGAGGGGCCTGATTTGATGATGTGAA
GTAGGCCTGGAAGAATCTAGGATTTTTTTTTAAATTT

2.6 Using bcftools

For this exercise we will be using a BCF file prepared using data deposited with NCBI (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA750802) In particular this BCF file is from one sample (SRA ID: SRR15301098) and after variant calling, the BCF file was subsetted to only include data from genes in chromosome 19.

NOTE: As this is a small file we can run some bcftools commands on the head node. For real data you should perform all of the work on a compute node, either through a SLURM batch script or interactive session.

  1. Link the BCF file from the workshop folder.
ln -s /workshops/ric_workshop/common/3_NGS/variants.bcf
  1. Load BCFtools.
module load BCFtools
  1. Get a count of the variants in the BCF file.
bcftools plugin counts variants.bcf
Number of samples: 1
Number of SNPs:    24364
Number of INDELs:  2272
Number of MNPs:    1445
Number of others:  403
Number of sites:   28389
  1. Subset the BCF file for only variants with a variant calling confidence of 20 (p=0.01, 99% confidence) and sequencing depth of at least 10.
    • The -i option lists the filtering criteria. We are going to filter on depth (DP) and use the data for each ssample (FORMAT)
  • The -O option sets the output type to BCF (b).
  • The -o options specifies the output file. The default action is to print to STDOUT.
bcftools view -i 'QUAL>20 && FORMAT/DP>10' -Ob -o variants_subset.bcf variants.bcf 
  1. Get counts of the subsetted BCF file.
bcftools plugin counts variants_subset.bcf
Number of samples: 1
Number of SNPs:    3007
Number of INDELs:  333
Number of MNPs:    210
Number of others:  58
Number of sites:   3599
  1. Print out the details (position, genotype and depth) for the subsetted variants. We will pipe it to head to see the first 10 lines.
    • The -H flag tells bcftools query to print a header
    • The -f option lists the fields to print. We will print the chromosome (CHROM), position (POS) and genotype (GT) for each sample. Per sample format fields, e.g. GT, need to be enclosed in square brackets to loop over all samples.
bcftools query -H -f'%CHROM\t%POS\t[%GT]' variants_subset.bcf | head
#[1]CHROM   [2]POS  [3]GT
19  4128293 0/1
19  4128312 0/1
19  5798782 0/1
19  6104861 0/1
19  6105124 0/1
19  6105217 0/1
19  6107748 0/1
19  6107755 0/1
19  6790300 0/1
  1. Convert to compressed VCF format (.vcf.gz). The compressed VCF file can then be downloaded and opened in IGV
bcftools convert -Oz -o variants.vcf.gz variants.bcf

2.6.1 BONUS/TAKE HOME: Using a filter.

The following is an example of using the VCF/BCF FILTER field to tag variants as filtered, which still retain the data in the file, and then view/query the variants that pass the filter. Using the FILTER field can be helpful to set various filters but, still be able to see the variants that did not meet the critera. Furthermore, you can apply a set of different filters one at a time and then do a final subsetting of just the variants that PASS.

  1. Use the filter command using the same criteria as step 4 in the previous steps. For this we will run in two steps. The first will filter based on the QUAL field for the variant and then based on depth.
    • The -s option sets it as a soft filter and sets the FILTER field for any variant to be “QUAL_20” if it fails to meet the first criteria (-i option) and then “DP_10” if fails to meet the second criteria.
bcftools filter -i 'QUAL>20' -s "QUAL_20" -Ou variants.bcf | \
   bcftools filter -i 'FORMAT/DP>10' -s "DP_10" -Ob -o variants_filtered.bcf - 
  1. Check the counts and you can see the number of variants is the same as the original file.
bcftools plugin counts variants.bcf
Number of samples: 1
Number of SNPs:    24364
Number of INDELs:  2272
Number of MNPs:    1445
Number of others:  403
Number of sites:   28389
bcftools plugin counts variants_filtered.bcf
Number of samples: 1
Number of SNPs:    24364
Number of INDELs:  2272
Number of MNPs:    1445
Number of others:  403
Number of sites:   28389
  1. We can use the bcftools query command to print details about the variants. We can see that the top variants have a depth less than 10. Thus the FILTER field is set to “DP_10”
bcftools query -H -f '%CHROM\t%POS\t[%GT\t%DP]\t%FILTER' variants_filtered.bcf | head
#[1]CHROM   [2]POS  [3]GT   [4]DP   [5]FILTER
19  3096362 0/1 3   DP_10
19  3096637 1/1 2   DP_10
19  3096652 1/1 2   DP_10
19  3096658 1/1 2   DP_10
19  3096695 1/1 2   DP_10
19  3122217 0/0 3   DP_10
19  3122222 0/0 3   DP_10
19  3152171 0/1 3   DP_10
19  3152187 1/1 2   DP_10
  1. Use bcftools view to subset the varaints based on the filter status and then pipe to bcftools plugin counts to see the filtered results.
bcftools view -f .,PASS variants_filtered.bcf | bcftools plugin counts
Number of samples: 1
Number of SNPs:    3007
Number of INDELs:  333
Number of MNPs:    210
Number of others:  58
Number of sites:   3599
  1. Use the same bcftools view command but, pipe to bcftools query to print the details.
bcftools view -f .,PASS variants_filtered.bcf | \
  bcftools query -H -f '%CHROM\t%POS\t[%GT\t%DP]\t%FILTER' - | head
#[1]CHROM   [2]POS  [3]GT   [4]DP   [5]FILTER
19  4128293 0/1 13  PASS
19  4128312 0/1 14  PASS
19  5798782 0/1 17  PASS
19  6104861 0/1 29  PASS
19  6105124 0/1 152 PASS
19  6105217 0/1 101 PASS
19  6107748 0/1 39  PASS
19  6107755 0/1 40  PASS
19  6790300 0/1 20  PASS

2.6.2 BONUS/TAKE HOME: Open VCF file in IGV

  1. Download the VCF file (variants.vcf.gz) to your computer using WinSCP or Cyberduck.

  2. Open IGV and select mm39 as the genome. This was the genome used to create the original BCF file.

  3. Load the compressed VCF file (variants.vcf.gz)

    1. From the File menu select Load from File…
    2. In Select Files dialog, find and select the downloaded VCF file and click Open

NOTE: The original BCF was created only using data for chromosome 19. So, you would only see variants in that chromosome.

  1. You can then navigate to the genomic location of interest. For example, find the gene “Prune2” (Hint: use the location/selection box in the top toolbar). You can then inspect individual variants by clicking on the bars in the variants.vcf.gz track and the sample (unknown) sub-track. If the VCF file had data for more than one sample, one would see multiple sub-tracks for variants.vcf.gz.

3 Bonus/Take home Exercises

3.1 Peak enrichment for ATAC/ChIP-seq

This exercise will use BEDTools to generate a set of merged peaks from four different samples and then quantitate the read coverage for each peak in each sample.

For this example, we have aligned data from four different mouse ATAC-seq samples, sorted, removed PCR duplicates and then subsetted to only include alignments from chromosome 19.

3.1.1 Perform the initial peak calling

  1. Download the ZIP file of BAM and associated .bai files from the following URL. and then uncompress. The individual BAM and .bai files are also available on Lakeshore at /workshops/ric_workshop/common/3_NGS/multi_atac
curl -OJL https://wd.cri.uic.edu/NGS/atac_bams.zip
unzip atac_bams.zip
  1. Create a SLURM script (nano peakcall_multi.sh) to run peak calling (MACS2) on the downloaded BAM files.

NOTE: This script will run peak calling on all BAM files that end with _chr19.bam. So, be careful if you have other BAM files in this directory. Furthermore, we have adjusted the genome size to just be the size of chromosome 19 in mm39 as the BAM files have been subsetted to just chromosome 19.

For each BAM file, this script will set the variable $sample to the value of the name of BAM file without the “.bam” at the end.

#!/bin/sh
#SBATCH --mem=10G
#SBATCH --partition=batch
#SBATCH --account=ric_workshop
#SBATCH -o %x.o%J

cd $SLURM_SUBMIT_DIR

module load MACS2

for i in *_chr19.bam; do
        sample=$(basename $i .bam)
        macs2 callpeak -t $i -f BAM -n ${sample} -g 61420004 -B --SPMR --nomodel --nolambda
done
  1. Submit the script (sbatch peakcall_multi.sh)

3.1.2 Merge the peaks and quantitate

Once the peak calling is complete you should see a set of files for each sample.

  1. You can check the number of peaks in each of the samples.
wc -l *.narrowPeak
    15 A.1_chr19_peaks.narrowPeak
    20 A.2_chr19_peaks.narrowPeak
   708 B.1_chr19_peaks.narrowPeak
   779 B.2_chr19_peaks.narrowPeak
  1522 total
  1. Create a SLURM script (nano quant_atac.sh) to merge the peaks and then requantitate for each sample.
    • After loading BEDtools, the first line will perform the following tasks and save to merged_peaks.bed
      • First it will concatenate (cat) all of the .narrowPeak files into one BED file
      • Next the cut command will just report the first three columns, i.e. chromosome, start and end positions.
      • The bedtools sort command will sort the combined BED data by chromosome and start position
      • The bedtools merge will merge overlapping regions listed in the BED data
      • The awk command will add an extra column to the merged output. This column will have the string “atac_peak_” followed by the line number in the output. This will give a nice name to all of the merged peaks.
    • The next three lines, starting with files= will create a header line for the counts table.
      • first the list of BAM files is save to the variable files as an array.
      • We set the BASH variable IFS (internal field separator) to a tab (\t). This way when we print out the array files it will automatically use a tab to separate the values.
      • We use echo to create a header line. The first four columns will come from the merged peaks BED file and then the BAM files
    • Run bedtools multicov to compute the read coverage in the list of BAM files for each region in the supplied BED file.
#!/bin/sh
#SBATCH --mem=10G
#SBATCH --partition=batch
#SBATCH --account=ric_workshop
#SBATCH -o %x.o%J

cd $SLURM_SUBMIT_DIR

module load BEDTools

cat *.narrowPeak | cut -f1-3 \
    | bedtools sort | bedtools merge \
    |  awk -F "\t" 'BEGIN { OFS="\t" } { print $0, "atac_peak_" NR }' > merged_peaks.bed

files=(*_chr19.bam)
IFS="\t"
echo -e "CHROMO\tSTART\tEND\tNAME\t${files}" > atac_quant.txt

bedtools multicov -bams *_chr19.bam -bed merged_peaks.bed >> atac_quant.txt
  1. Submit the job to the queue (sbatch quant_atac.sh)

  2. After the job is complete you can check the output file (atac_quant.txt)

head atac_quant.txt
CHROMO  START   END NAME    A.1_chr19.bam
19  3373242 3373467 atac_peak_1 3   0   2   13
19  3384176 3384376 atac_peak_2 0   2   0   6
19  3410001 3410201 atac_peak_3 0   0   0   5
19  3438700 3438900 atac_peak_4 0   1   0   4
19  3439226 3439440 atac_peak_5 0   0   8   8
19  3473089 3473382 atac_peak_6 0   0   10  4
19  3480788 3481103 atac_peak_7 0   1   10  12
19  3501776 3502147 atac_peak_8 0   2   10  12
19  3625743 3626379 atac_peak_9 4   18  34  36

You can then perform differential analysis on the counts in this file using edgeR. Note, when you read in the table to R for analysis with edgeR you would want to strip off the first three columns so that that the feature ID would be the peak names given. After analysis you can then merge back the first three columns, i.e. chromosome, start and end, to get a sense of the genomic locations for each of the peaks.

atac_table <- read.delim("atac_quant.txt")

atac_data <- atac_table[,-(1:3)]
atac_peaks <- atac_table[,c(4,1:3)]

#
# Run differential analysis using edgeR....
# For this we are skipping the lines you would run for a normal edgeR analysis 
# and jumping ahead to when one gets the diff table.
#
diff = stats$table
# Merge back the details of the peaks, i.e. chromosome, start and end
diff <- merge(atac_peaks, diff, by.x=1, by.y=0)
# Write to a file
write.table(diff, "diff.txt", sep="\t", quote=F

3.2 UCSC genome browser

The following are example exercises using the UCSC genome browser.

For this example, We’ll take a look at the TP53 locus in hg38 (most recent human reference genome version).

  • Open a web browser and navigate to genome.ucsc.edu
    • From the Genomes menu, select the human hg38 genome
  • Type “TP53” into the search bar, and click the first link that comes up
  • Zoom out 3x so we can see the whole window (~57kb)
  • Shift-click and drag to select a window
  • Click on transcript annotations to see more information

3.2.1 UCSC genome browser: tracks

  • Scroll down to the tracks to see all of the options to show. A couple to try:
    • Mapping and Sequencing > GC Percent > dense
    • Mapping and Sequencing > Mappability > show. Mappability measures whether how well a short Illumina read could be mapped to a specific region of the genome. Some regions, for example with repetitive elements, will not be uniquely mappable with short-read data. This is even more of an issue with bisulfite data, as the conversion of unmethylated C to T (when sequenced) reduced the complexity of genomic sequence.
    • Variation > Common SNPs (151) > full
  • Click “refresh”
  • We don’t usually load our own data in the UCSC genome browser: to do this, you’d first need to upload it to a public-facing web server.

3.3 UCSC table browser

Download a gene annotation file for hg38:

  • Click on the Tools menu, select “Table Browser”
  • Selections:
    • clade Mammal, genome Human, assembly Dec. 2013 (GRCh38/hg38)
    • group Genes and Gene Predictions, track Gencode v29
    • table knownGene
  • Select a filter so that the output is not too big (just for this exercise)
    • Click on filter, then select chrom does match “chr22”, hit submit
  • output format GTF - gene transfer format (limited)
  • ouput file “hg38_gencode_chr22.gtf”
  • Click “get output”

Table browser:

Setting filters:

3.4 UCSC genome browser: liftOver tool

The liftOver tools allows us to convert coordinates between assemblies of the same organism (e.g., mm39 to mm9), or even between organisms based on homology (e.g., mm39 to hg38). We can try it on our ATAC-seq peaks.

  • Click on the Tools menu, select “LiftOver”
  • Selections:
    • Original Genome: Mouse
    • Original assembly: mm39
    • New genome: Human
    • New assembly: Dec. 2013 (GRCh38/hg38)
  • Choose File and browse to the atac_summits.bed file
  • Click “Submit”
  • It will take ~30 seconds to finish. Note that not all regions will be converted, as not every region will have a perfectly homologous match.

3.5 Example variant calling

NOTE For this exercise we have removed the parition/queue and accounting details from the SLURM scripts. For a short time after the end of the workshop you should still be able to use the ric_worshop accouting code with the batch partition. However, if you plan to run these exercise later be sure to set the proper partition and accounting code for your HPC account.

For this example, we will be using data deposited with NCBI (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA750802) These data were obtained from mouse samples. So, we will be using the mm39 reference in this example.

As a test case, you can use the following SLURM script to download the data for one of the samples (SRA ID: SRR15301098) and create a subset of the data. This script will subset the first 1 million read (4 million FASTQ lines).

#!/bin/sh
#SBATCH --time=10:00:00
#SBATCH --mem=10G
#SBATCH -o %x.o%J

cd $SLURM_SUBMIT_DIR

module load SRA-Toolkit

fasterq-dump SRR15301098 -O raw_data -p --split-3

for a_file in raw_data/*.fastq ; do
        sample=$(basename $a_file .fastq)
        echo "$sample"
        head -n 4000000 $a_file | gzip -c > ${sample}.fastq.gz
done

3.5.1 Alignment

For the genome alignment we can use BWA-MEM and thus can use the same genome index used in the Genome Alignment exercise (1.x). In the alignment step, we will also sort, remove any PCR duplicates and index the resulting BAM file.

  1. Create a SLURM script (nano mapping_vc.sh) that looks like the following.
#!/bin/sh
#SBATCH --mem=10G
#SBATCH --ntasks-per-node=4
#SBATCH -o %x.o%J

cd $SLURM_SUBMIT_DIR

module load BWA
module load SAMtools

bwa mem -t $SLURM_TASKS_PER_NODE mm39/mm39.fa SRR15301098_1.fastq.gz SRR15301098_2.fastq.gz \
  | samtools view -b - > var_calling.bam

samtools sort var_calling.bam > var_calling_sort.bam
samtools rmdup var_calling_sort.bam var_calling_nodups.bam
samtools index var_calling_nodups.bam

Remember: Ctrl+O to save, Ctrl+X to exit from nano.

  1. Submit the script to SLURM queue.
sbatch mapping_vc.sh

3.5.2 Variant callling using freebayes

We will use freebayes (https://github.com/freebayes/freebayes) to call the variants from the alignment files. We will also use bcftools to convert the output from freebayes (VCF) to a compressed BCF file

  1. Create a SLURM script (nano freebayes.sh) that looks like the following.
#!/bin/sh
#SBATCH --mem=10G
#SBATCH -o %x.o%J

cd $SLURM_SUBMIT_DIR

module load freebayes
module load BCFtools

freebayes --fasta-reference mm39/mm39.fa var_calling_nodups.bam \
  | bcftools -o variants.bcf -O b -

Remember: Ctrl+O to save, Ctrl+X to exit from nano.

sbatch freebayes.sh

3.5.3 Variant callling using bcftools

bcftools has the ability to call variants from an alignment file. This may not be a sophisticated or robust as freebayes. However, depending on the goal of your project it may work for you.

  1. Create a SLURM script (nano bcftools_call.sh) that looks like the following.
#!/bin/sh
#SBATCH --mem=10G
#SBATCH -o %x.o%J

cd $SLURM_SUBMIT_DIR

module load BCFtools

bcftools mpileup -Ou -f mm39/mm39.fa var_calling_chr19.bam | \
        bcftools call -Ob -mv > variants.bcf

Remember: Ctrl+O to save, Ctrl+X to exit from nano.

sbatch bcftools_call.sh

3.6 Processing Methylation (RRBS) data using Bismark

NOTE For this exercise we have removed the parition/queue and accounting details from the SLURM scripts. For a short time after the end of the workshop you should still be able to use the ric_worshop accouting code with the batch partition. However, if you plan to run these exercise later be sure to set the proper partition and accounting code for your HPC account.

For this example, we will be using data deposited with NCBI (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1252737) These data were obtained from mouse samples. So, we will be using the mm39 reference in this example.

As a test case, you can use the following SLURM script to download the data for one of the samples (SRA ID: SRR33213504) and create a subset of the data. This script will subset the first 1 million read (4 million FASTQ lines).

#!/bin/sh
#SBATCH --time=10:00:00
#SBATCH --mem=10G
#SBATCH -o %x.o%J

cd $SLURM_SUBMIT_DIR

module load SRA-Toolkit

fasterq-dump SRR33213504 -O raw_data -p --split-3

for a_file in raw_data/*.fastq ; do
        sample=$(basename $a_file .fastq)
        echo "$sample"
        head -n 4000000 $a_file | gzip -c > ${sample}.fastq.gz
done

3.6.1 Build Bismark index

Like other alignment tools, Bismark needs its own index and this will only need to be run once per reference.

  1. Bismark likes to have the reference file in its own directory. So, create a directory for the FASTA file.
mkdir bismark_ref_mm39
  1. Download the approriate genome sequence (FASTA) file. For this example, the data were from mouse so, download the FASTA file for mm39 (https://ftp.ensembl.org/pub/current/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.toplevel.fa.gz). We also need to save as uncompressed FASTA for bismark.
cd bismark_ref_mm39
curl https://ftp.ensembl.org/pub/current/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.toplevel.fa.gz | zcat > mm39.fa
  1. Create a SLURM script (nano index.sh) that looks like the following.
#!/bin/sh
#SBATCH --time=100:00:00
#SBATCH --mem=30G
#SBATCH -o %x.o%J

cd $SLURM_SUBMIT_DIR

module load Bismark

bismark_genome_preparation bismark_ref_mm39
  1. Submit the script to SLURM queue.
sbatch index.sh

3.6.2 Run Bismark on a single sample.

  1. Create a SLURM script (nano run_bismark.sh) that looks like the following.
    • The first bismark command will run the alignment and generate a BAM file.
    • The bismark_methylation_extractor command will extract a methylation report from the BAM file produced by Bismark.
    • The --comprehensive option will have the command merge strand specific methylation data for each context, e.g. CpG, CHG, CHH, into a single file for each context.
    • The --gzip will compress the output using gzip. The table can be very large and it is recommended to have it compressed. It is possible to gzipped table directly into R.
#!/bin/sh
#SBATCH --time=100:00:00
#SBATCH --mem=30G
#SBATCH --ntasks-per-node=4
#SBATCH -o %x.o%J

cd $SLURM_SUBMIT_DIR

module load Bismark

bismark -p 4 --genome bismark_ref_mm39 -1 SRR33213504_1.fastq.gz -2 SRR33213504_2.fastq.gz
bismark_methylation_extractor --gzip --comprehensive -p SRR33213504_1_bismark_bt2_pe.bam
  1. Submit the script to SLURM queue.
sbatch run_bismark.sh

3.6.3 per-loci methylation report.

The output from bismark_methylation_extractor is a per sequence report with the following columns. seq-ID, methylation state, chromosome, start position (= end position), methylation call. You can use the following R code to summarize the report to create counts per genomic loci.

library(dplyr)

me_full_report <- read.delim(gzfile("CpG_context_SRR33213504_1_bismark_bt2_pe.txt.gz"), skip=1, 
                             col.names=c("sequence", "me_state", "chromosome", "posn", "me_call"))


me_counts <- subset(me_full_report, me_state == "+") %>% 
  group_by(chromosome, posn) %>%
  summarize(me_count=n())

all_counts <- me_full_report %>% 
    group_by(chromosome, posn) %>%
    summarize(all_count=n())

me_report <- full_join(all_counts, me_counts, by=join_by(chromosome, posn)) %>%
    mutate(me_count=coalesce(me_count, 0)) %>%
    mutate(me_fraction=me_count / all_count)

After creating the me_report data.frame you may want to filter the loci to only retain those sites with a minimum level of coverage.